Estimating PM2.5 utilizing multiple linear regression and ANN techniques

The accurate prediction of air pollutants, particularly Particulate Matter (PM), is critical to support effective and persuasive air quality management. Numerous variables influence the prediction of PM, and it's crucial to combine the most relevant input variables to ensure the most dependable predictions. This study aims to address this issue by utilizing correlation coefficients to select the most pertinent input and output variables for an air pollution model. In this work, PM2.5 concentration is estimated by employing concentrations of sulfur dioxide, nitrogen dioxide, and PM10 found in the air through the application of Artificial Neural Networks (ANNs). The proposed approach involves the comparison of three ANN models: one trained with the Levenberg–Marquardt algorithm (LM-ANN), another with the Bayesian Regularization algorithm (BR-ANN), and a third with the Scaled Conjugate Gradient algorithm (SCG-ANN). The findings revealed that the LM-ANN model outperforms the other two models and even surpasses the Multiple Linear Regression method. The LM-ANN model yields a higher R2 value of 0.8164 and a lower RMSE value of 9.5223.

Further, investigating the entire involved parameters contributing air pollution is an arduous task.To deal with this, air pollution models are desired to evolve early warnings and command actions and further to examine forthcoming ensuing discharge schemes 17,18 .The increasing role of machine learning in air quality prediction represents a significant leap forward in our ability to monitor and manage environmental health.Machine learning techniques have ushered in a new era of air quality forecasting, allowing us to harness vast amounts of data, including historical air quality information, meteorological data, and even satellite imagery.These algorithms can identify complex patterns and correlations within this data, enabling more accurate predictions of air quality parameters such as PM concentrations, ozone levels, and pollutant concentrations.By providing real-time, high-resolution forecasts, machine learning models empower policymakers, environmental agencies, and the public to make informed decisions, take preventive measures, and mitigate the adverse effects of air pollution on public health and the environment.The growing integration of machine learning into air quality prediction signifies a promising avenue for advancing our understanding of air pollution dynamics and enhancing the quality of life for communities around the world.Among various statistical procedures, Artificial Neural Networks (ANNs) have been demonstrated to be altogether effective for appropriating complex relationships and enhancing forecast accuracy [19][20][21][22][23][24] .
ANNs are computational models inspired by the structure and function of the human brain.They consist of interconnected nodes, or artificial neurons, organized into layers.These networks are used for various machine learning tasks, including pattern recognition, classification, regression, and even more complex tasks like natural language processing and image recognition.In ANNs, information flows through the network, with each neuron processing and transmitting data to the next layer.ANNs have gained widespread popularity due to their ability to handle complex and high-dimensional data, making them a crucial component of modern artificial intelligence and deep learning applications.They have been instrumental in advancing fields such as computer vision, speech recognition, and autonomous systems, among many others.Broadly, different kinds of ANN involve the back-propagation neural network 25,26 , multilayer perceptron 27,28 , radial basis function 29,30 , and adaptive neurofuzzy inference systems 31,32 .
The primary objective of this study is to assess the performance of ANN trained with different algorithms for predicting PM 2.5 concentration.Additionally, we have conducted a comparative analysis with the traditional multiple linear regression model (MLR).

Background
Numerous researchers have engaged in the thorough evaluation of air quality prediction models, with a specific emphasis on the precision of PM concentration predictions across a wide spectrum of scenarios, employing ANN.The use of ANNs for estimating PM concentration has been asserted for the prediction of hourly and daily average concentrations relying on air pollutants and atmospheric data 33,34 .In the Santiago city of Chile, Perez et al. 35 demonstrated estimations of hourly average concentrations of PM 2.5 several hours before, depending on values attained at a steady site.Further, outcomes acquired employing ANN revealed estimated errors within the extent 30-60%.Moreover, they examined the noise cutback of dataset to enhance predictions as imperative.A comparison of ANN technique with classical regression techniques for PM 10 and PM 2.5 estimation was conducted by McKendry 36 .He established that meteorological variables, endurance, and co-pollutant values effectively estimated PM levels.In another investigation, Chelani et al. 37 entrenched an ANN procedure to predict PM 10 and noxious metals contamination investigated in the Jaipur city of India.Authors were adept at estimating contaminations quite justly.Tecer 38 suggested ANNs to estimate PM levels in Zonguldak Province, Turkey.The outcomes revealed that the suggested technique can effectively be employed to estimate air quality.Pires et al. 39 demonstrated the accomplishment of five linear models to estimate the daily average PM 10 levels and certified that the size of the dataset is an imperative factor for the estimation of models.Paschalidou et al. 40 employed multilayer perceptron for PM 10 hourly levels prediction in Cyprus.The prediction revealed that the MLP models displays the best estimation performance.Also, Roy et al. 41 have suggested the utilization of both multiple regression and ANN techniques for analyzing PM levels in different seasons at a vast opencast coal mine in India.The findings indicated that the ANN-based forecasting outperformed the multiple regression models.An online air pollutants predicting ANN technique that utilizes parameters attained through geographic modeling for the district Besiktas, Instanbul was suggested by Kurt & Oktay 42 .This system employs the meteorological parameters, the air pollutants levels and certain area specific attributes as input parameters.The ANN technique was carried out in this study to develop PM 2.5 concentration prediction model.In Spain, another ANN model for PM 10 daily levels estimation was suggested that executes the estimation of a 24 h average PM 10 levels and employs deterministic variables for overall transit of aerosols from arid areas 43 .An innovative approach was employed to forecast PM 2.5 and PM 10 levels in major Chinese cities.This approach integrated a feedforward ANN model with a rolling criterion to capture input data patterns and a cumulative generating conduct of gray model to reduce data sample unpredictability 44 .The prediction procedure relied mainly on the daily values of PM 2.5 and PM 10 levels and on a few atmospheric parameters.With an aim to analyze the impact of exposure to PM 10 on health and to estimate PM 10 levels using ANN another study was conducted in Yasuj city 45 .The daily average values of PM 10 as well as the climatic data was utilized in this analysis.In general, amongst all the machine learning approaches, ANN has been proven to be the most favorable approach of the researchers.This study examined ANN technique with varying training functions to establish the most effective model for PM 2.5 estimation.

Methodology Study area and air quality data
In India, central pollution control board (CPCB) is the pinnacle institution that investigates and monitors air quality.This institution supervises air pollution with the support of its abundant stations extended in nearly  46 .The data generated through manual and continuous monitoring integrated for the year 2021 has been taken for this study involving the annual average values of SO 2 , NO 2 , PM 10 and PM 2.5 (in μg/m 3 ) as shown in Fig. 1.

Modeling and opting suitable input variables
The observed levels of air pollutants PM 10 , PM 2.5 , NO 2 , and SO 2 were investigated with an objective to frame an air pollution estimation model.The specific dataset was sourced from CPCB for the year 2021.Figure 2 visually represents the relationships among SO 2 , NO 2 , PM 10 , and PM 2.5 levels.We observed a positive correlation among all these variables, signifying their relevance to the study.Notably, the maximum correlation values were found to be 0.31 for PM 2.5 with SO 2 , 0.61 with NO 2 , and 0.83 with PM 10 .Consequently, SO 2 , NO 2 , and PM 10 were selected as the input variables for the PM 2.5 air pollution estimation model.

Multiple linear regression (MLR) model
Multiple linear regression (MLR) is a statistical technique used to assess the relationships between a single dependent variable and two or more independent variables.The method works by fitting a linear equation to the data, with coefficients representing the contribution of each independent variable to the dependent variable.The model aims to find the best-fitting line through the data points, which minimizes the sum of the squared differences between the observed and predicted values.MLR is a valuable tool for uncovering complex associations and understanding the underlying factors that influence a particular phenomenon.The formula for expressing the output dependent variable y in terms of independent variables x 1 , x 2 , …, x n is as follows: where n = number of observations, α 0 = the y intercept, α n = coefficient of the independent variable x n and ε = model error.
In this particular investigation, PM 2.5 is taken as a dependent variable, while SO 2 , NO 2 and PM 10 are considered as independent variables.The MLR model computes the coefficients α 1 , α 2 ,…,α n using the least square method.

Proposed ANN models
In the present analysis, the nftool of MATLAB (Version R2014b) was used and executed on a system equipped with an Intel HD Graphics card, a 17-inch display, 4 GB of memory, an Intel 11th generation i5 430 M processor, and a 512 GB SSD [47][48][49] .The ANN were trained using NO 2 , SO 2 , and PM 10 as input variables and PM 2.5 as the target variable.The neural network architecture consisted of 20 neurons in the hidden layer, as depicted in Fig. 3.For the initial training phase, the dataset was partitioned into training (70%), validation (10%), and testing (20%) subsets.ANNs employ various training algorithms to adjust the network's parameters (weights and biases) in order to minimize errors and improve performance.In this study, the network underwent training using three distinct algorithms sequentially: Levenberg-Marquardt, Bayesian Regularization, and Scalar Conjugate training algorithms.
Performance metrics.The assessment and differentiation of the MLR and the three proposed Artificial Neural Network (ANN) models are carried out by examining the Root Mean Square Error (RMSE) and Coefficient of Determination (R 2 ).These metrics are defined as follows: where, n = number of observations, o(t) = actual value of the variable , p(t) = predicted value of the variable.

Simulation results and discussion
This study focuses on the estimation of PM 2.5 concentrations using annual average data of SO 2 , NO 2 , and PM 10 for the year 2021 as input parameters.The performance evaluation of the models was based on the Root Mean Square Error (RMSE) and the coefficient of determination (R 2 ).These metrics provided insights into the effectiveness of the MLR model and the three ANN models: ANN trained using the Levenberg-Marquardt algorithm (LM-ANN), the Bayesian Regularization algorithm (BR-ANN), and the Scaled Conjugate Gradient algorithm (SCG-ANN).The experiments entail a comparison between the MLR model and three ANN models.The results of this comparison, specifically the evaluation metrics RMSE and R 2 , are presented in Table 1 for reference.
The results revealed that the LM-ANN model outperformed the others, yielding the lowest RMSE of 9.5223 compared to 9.6555, 11.0165, and 11.7585 for BR-ANN, SCG-ANN, and MLR, respectively.Furthermore, the PM 2.5 concentration estimated by the LM-ANN model demonstrates a strong correlation with observed values, with an R 2 of 0.8164.In contrast, BR-ANN exhibits an R 2 value of 0.8118, while SCG-ANN yields 0.7551, and MLR results in 0.7201.
Correlation amongst observed and estimated LM-ANN, BR-ANN SCG-ANN and MLR models are illustrated in Fig. 4. Additionally, Figs. 5, 6, and 7 represent the regression plots for the LM-ANN, BR-ANN, and SCG-ANN models, respectively.Moreover, Fig. 8 provides a time series illustration of the detected and estimated PM 2.5 values for the suggested models.
Notably, the results highlighted the superior performance of the LM-ANN model in comparison to the other models, signifying its enhanced capability in estimating PM 2.5 concentrations.
The current investigation offers a comprehensive exploration of the effectiveness of various ANN techniques when applied to the air quality modeling.This research not only sheds light on the adequacy of different ANN methodologies but also delves into their relative strengths and weaknesses in the context of air quality modeling.Consistency and size of data used, alongwith upholding of identical controlling factors for training and testing data are some of the limitations of proposed ANN models.By examining these diverse ANN approaches, we gain a deeper understanding of how they perform and contribute to the field of air quality modeling.data concerning SO 2 , NO 2 , and PM 10 concentrations.The three distinct ANN models, namely LM-ANN, BR-ANN, and SCG-ANN were applied to India's air quality dataset for the year 2021 sourced from the CPCB.The error metrics, specifically Root Mean Square Error (RMSE) and R-squared (R 2 ) were employed to assess the performance of these models.The findings demonstrated that the LM-ANN model exhibited superior performance compared to the other two ANN models and the Multiple Linear Regression (MLR) model.Moreover, these models have the potential to alert the public when PM concentration surpasses its prescribed level.Furthermore, the suggested models can be deployed to forecast real-time air quality trends using historical data, making them valuable tools for proactive planning and management of air pollution concerns.In summary, this ANN modeling approach offers a practical solution for governmental agencies to address air pollution issues and formulate effective strategies for mitigating their impact.
Regarding future research directions, we aim to extend our investigations into air pollution by integrating daily and hourly data, thereby enabling a more exhaustive analysis of pollution levels across diverse urban areas.Additionally, doe to the prominent performance exhibited by the LM-ANN model, there is potential for further enhancements to fine-tune its capabilities in air quality prediction.

Figure 2 .
Figure 2. Correlation matrix of air pollutants in India for the year 2021.

Figure 3 .
Figure 3.The structure of the ANN layers.
www.nature.com/scientificreports/Conclusionand future scopeParticulate matter (PM) is a major air pollutant known to have detrimental impacts on human health.This study involved the predictive analysis of PM 2.5 levels by utilizing Artificial Neural Network (ANN) models based on y = 0.8114x + 7

Figure 8 .
Figure 8.Comparison of the observed and estimated values of PM 2.5 .
every city.Air quality across the country is systematically monitored through a combination of Manual and Continuous Ambient Air Quality monitoring stations.At present, this network comprises a total of 1257 monitoring stations.Manual monitoring activities are undertaken at 883 stations, encompassing 378 cities and towns distributed across 28 States and 7 Union Territories.Simultaneously, continuous monitoring is carried out at 374 stations, situated in 190 cities and towns across 27 States and 4 Union Territories.To facilitate the monitoring of air pollutants, the responsibility is shared with various entities such as the State Pollution Control Boards (SPCB), Pollution Control Committees (PCC), and other reputable institutions.The CPCB collaborates with these organizations to ensure the uniformity and consistency of air quality data while offering technical and financial support.It identifies and calculates pollutants as well as atmospheric parameters.Moreover, the monitoring of air pollutants is enforced with the support of SPCB, PCC, and several other reputed organizations.CPCB work with these assisting institutes to provide uniform, consistent air quality data Vol.:(0123456789) Scientific Reports | (2023) 13:22578 | https://doi.org/10.1038/s41598-023-49717-7www.nature.com/scientificreports/

Table 1 .
Statistical error indices.Bold values indicate the best value from others.